1 research outputs found
Complex models for genetic sequence data
PhD ThesisIn this thesis, the aim is to develop biologically motivated Bayesian models in two areas:
molecular phylogenetics and time-series metagenomics. In molecular phylogenetics, the
goal is generally to learn about the evolutionary history of a collection of species using
molecular sequence data, for example, DNA. Evolutionary history is represented graphically using evolutionary trees, where the root of a tree represents the most recent common
ancestor of all species in the tree. Substitutions in sequences are modelled through a continuous time Markov process, characterised by an instantaneous rate matrix, which standard models assume is stationary and time-reversible. These assumptions are biologically
questionable and induce a likelihood function which is invariant to a tree’s root position.
This is detrimental to inference, since a tree’s biological interpretation depends on where it
is rooted. By relaxing both assumptions, we introduce two new models whose likelihoods
can distinguish between rooted trees. These models are non-stationary, with step changes
in the rate matrix on each branch. Each rate matrix belongs to a non-reversible family
of Lie Markov models, which are closed under matrix multiplication. The two models
differ in that a different non-reversible Lie Markov model is used in each. We perform our
analysis in the Bayesian framework using Markov chain Monte Carlo methods. We assess
the performance of our models using a simulation study, before considering an application
to a Drosophila data set, where most models fail to identify a plausible root position.
In time-series metagenomics, counts of operational taxonomic units (OTUs), which
are pragmatic proxies for microbial species, are modelled over time. We have weekly
counts of different OTUs from two tanks in a wastewater treatment plant. We develop
a Bayesian hierarchical vector autoregressive model to model the dynamics of the OTUs,
whilst also incorporating environmental and chemical data. Clustering methods are explored to reduce the dimensionality of our data and mitigate the issue of large proportions
of zero-counts in the data. We use a seasonal phase-based clustering approach and a
symmetric, circulant, tri-diagonal error structure. The autoregressive coefficient matrix is
assumed to be sparse, so we explore different priors that allow for sparsity by analysing
simulated data sets before selecting the regularised horseshoe prior for our hierarchical
model. The chemical and environmental covariates are incorporated through a time varying mean. Finally, we fit the model to the data from each tank using Hamiltonian Monte
Carlo